AI Detection Processing Sizing and Scalability Guide
This guide provides hardware specifications, software requirements, and sizing and scalability guidance for the AI detection tier of an on-premises VIDIZMO deployment. It applies to any on-premises VIDIZMO deployment that runs AI detection within your own data center.
Note: This guide covers batch, on-demand AI detection over uploaded media (video, audio, documents, and images). The throughput figures here are faster-than-real-time batch rates. It does not cover sizing for continuous live-surveillance camera AI (such as Face and Object Recognition for Live Surveillance), which is a different per-camera, continuous-inference workload and is sized separately.
Overview
AI detection and analysis run on the AI Server, the GPU-bound tier that performs inference. The workloads in scope are object detection, OCR, PII detection, and transcription, across video, audio, document, and image media.
In many deployments the user and viewer population is small relative to the volume of content, so sizing is governed by how much content you push through AI detection, not by concurrent users or license counts. This guide focuses on the AI Server tier for that reason.
The sizing figures are grounded in benchmark testing on NVIDIA RTX 5090 and RTX 6000 Blackwell (RTX PRO 6000 Blackwell) GPUs, with detection throughput measured for all four media types.
How AI Detection Fits Into the Deployment
VIDIZMO uses a distributed, event-driven architecture. Detection jobs are dispatched asynchronously by the content-processing (workflow) service, which queues and parallelizes work and routes it to available AI Server instances through service discovery. This is what lets the AI tier scale by adding servers rather than by growing a single host indefinitely.
The AI Server runs the inference services for object detection, OCR, PII detection, and transcription. Its only runtime dependencies are the service-discovery and coordination component and the content-processing service that dispatches jobs and collects results. The AI Server does not need access to the application, database, or caching and broker servers.
Each detection job runs one or more of these services in sequence. For example, an image may pass through object detection, then OCR, then PII detection. When jobs are pipelined across a batch, sustained throughput is governed by the slowest stage in the chain.
Workload Characterization
Each detection function stresses the hardware differently. The table below maps the in-scope workloads to media types and the resource that limits them.
| Workload | Applies to | Dominant bottleneck |
|---|---|---|
| Object detection | Video, image | GPU (CUDA) |
| OCR | Document, image, video frames | Mixed CPU and GPU |
| PII detection (visual: faces, plates) | Video, image | GPU |
| PII detection (textual) | Document, transcribed audio | CPU and NLP |
| Transcription | Audio, video | GPU |
Detection is fast and scales predictably. Document and image detection are governed by page and item counts, audio detection runs well above real time, and video detection runs near real time. None of the stages carry a large per-item cost, so detection capacity scales cleanly as you add servers.
Video is the heaviest detection workload per hour of content. Because every frame is scanned for objects and PII, video detection runs at roughly real time, while a single server clears hundreds of thousands of document pages or images in the same period. Denser footage, with more objects per frame, runs at the slower end of the measured range.
Benchmark Results: Detection Throughput
Measured on NVIDIA RTX 5090 and RTX 6000 Blackwell GPUs (single AI host), at batch sizes up to 40 to 50.
Per-stage rates:
| Stage | Video | Audio | Document | Image |
|---|---|---|---|---|
| Object detection | 2.9× real time | N/A | ~24,000 pages/hr | ~1,800 img/hr |
| OCR | 2.8× real time | N/A | ~24,400 pages/hr | ~1,985 img/hr |
| Transcription | N/A | 35.8× real time | N/A | N/A |
| PII (audio includes transcription) | 1.9× real time | 10.9× real time | ~22,600 pages/hr | ~1,350 img/hr |
Combined throughput per AI server, sustained and governed by the slowest stage in each media's chain:
| Media | Detection chain | Sustained throughput (1 server) |
|---|---|---|
| Video | Object, OCR, PII | ~1.9× real time |
| Audio | Transcription, PII | ~10.9× real time |
| Document | Object, OCR, PII | ~22,600 pages/hour |
| Image | Object, OCR, PII | ~1,350 images/hour |
Detection rates are fast and consistent. Documents detect at roughly 22,000 to 24,000 pages per hour, and audio transcribes at nearly 36× real time. Plan some headroom for retries and the occasional reprocessed item.
Note: These combined figures are best-case pipelined rates that assume stages overlap across items. On a single GPU, where object detection, visual PII, and transcription all share the same hardware, sustained throughput can run somewhat below the slowest-stage figure. The 80% utilization basis used in the capacity table below builds in margin for this, but size with that contention in mind when planning tightly.
What One AI Server Can Handle
Based on the combined rates above, the table shows the capacity of a single AI server running continuously at about 80% utilization (roughly 19 processing-hours per day, or about 584 per month, which leaves headroom for spikes and retries).
| Media and detection chain | Per hour | Per day (24×7 @ ~80%) | Per month |
|---|---|---|---|
| Video (object, OCR, PII) | ~1.9 hrs footage | ~36 hrs footage | ~1,110 hrs footage |
| Audio (transcription, PII) | ~11 hrs audio | ~209 hrs audio | ~6,400 hrs audio |
| Documents (object, OCR, PII) | ~22,600 pages | ~434,000 pages | ~13.2 million pages |
| Images (object, OCR, PII) | ~1,350 images | ~25,900 images | ~788,000 images |
Note: All figures cover AI detection only (object detection, OCR, PII detection, and transcription). Video is by far the heaviest workload: a single server processes only a couple of hours of footage per hour of operation, while the same server clears millions of document pages or hundreds of thousands of images per month. Denser video footage lands at the lower end of the video range.
Hardware Specifications for the AI Server
These are the validated V12 AI Server baselines. The minimum is a proof-of-concept floor. The recommended column is the production starting point for an active detection workload.
| Component | Minimum (proof of concept) | Recommended (production) |
|---|---|---|
| CPU | 4 cores | 8 cores |
| RAM | 16 GB | 32 GB or more |
| GPU | NVIDIA GPU with CUDA support (Tesla T4, RTX 4090, or A-series) | NVIDIA GPU with CUDA support (RTX 5090 or RTX 6000 Blackwell) |
| Disk | 500 GB SSD | 1 TB SSD or NVMe (over 500 MB/s) |
| Network | 10 Gbps | 10 Gbps |
The GPU is the throughput-determining component and is mandatory for this workload mix, since object detection, visual PII, and transcription are all CUDA workloads. In benchmarking, a single host sustained the detection rates above at batch sizes up to 40 to 50, which shows the GPU handled high concurrency without collapsing. For detection, CUDA compute capability and VRAM are the throughput drivers. Video-encode (NVENC) capability is not used by detection.
GPU Selection Guidance
AI detection is CUDA-bound inference. The two factors that determine throughput are CUDA compute performance and VRAM. VRAM determines how many models can stay resident and how large a batch one server can run. For context, the RTX PRO 6000 Blackwell uses the same silicon as the consumer RTX 5090 but is validated for professional use and carries 96 GB of GDDR7 ECC memory, triple the RTX 5090's 32 GB.
| Category | Representative cards | VRAM | Fit for detection |
|---|---|---|---|
| Consumer RTX (40 and 50 series) | RTX 4090 (24 GB), RTX 5090 (32 GB) | 12 to 32 GB | Strong price and performance, no ECC, consumer driver licensing |
| Professional workstation | RTX A6000 or 6000 Ada (48 GB), RTX PRO 6000 Blackwell (96 GB), RTX PRO 5000 Blackwell (72 GB) | 16 to 96 GB | Best balance: ECC, certified drivers, high VRAM |
| Datacenter, inference | NVIDIA L4 (24 GB), L40S (48 GB) | 24 to 48 GB | Excellent rack-dense inference |
| Datacenter, compute | A100 (40 or 80 GB), H100 (80 GB), B-series | 40 GB and up | Excellent for detection (high compute and VRAM) |
Consumer RTX 40 and 50 series offer the strongest price-to-performance for the AI Server's inference role and handled batch-40 detection comfortably in testing. The trade-offs against professional cards are no ECC memory, consumer driver and licensing terms (review NVIDIA's data-center driver licensing before deploying consumer cards in a server), lower physical density, and less validation for continuous duty.
Professional workstation cards are the recommended production choice for their ECC memory, certified drivers for sustained operation, and high VRAM (48 GB on the RTX A6000 and RTX 6000 Ada, 72 to 96 GB on the Blackwell generation). For a single server expected to hold the full model set and run at high concurrency, the higher-VRAM Blackwell workstation cards are the strongest fit. The RTX PRO 6000 Blackwell has no native NVLink, so VRAM cannot be pooled across cards. Scale by adding independent worker servers rather than expecting multi-card memory aggregation.
Datacenter GPUs are well suited to detection. The L4 and L40S provide rack-dense inference, and because the AI tier performs detection only (no video encoding), the flagship compute GPUs (A100, H100, and the Blackwell B-series) are also excellent choices. Their large VRAM and high CUDA performance directly accelerate detection and transcription.
Software Requirements
The AI Server supports two deployment paths. Choose based on your data-center operating system standards.
Windows path
- Windows Server 2019 or 2022 (Standard, Enterprise, or Datacenter)
- Microsoft .NET Framework 4.8 and the .NET Core Hosting Bundle version 8 or later
- Python 3.9.13
Linux path (containerized)
- Linux with Docker support (Ubuntu LTS or RHEL recommended)
- Docker runtime, NVIDIA Container Toolkit, and NVIDIA GPU drivers
The AI Server needs network connectivity to two things only: the service-discovery and coordination component, and the content-processing service that dispatches detection jobs and collects results. It does not need access to the caching and broker server. Confirm the exact ports with VIDIZMO during deployment planning. For licensing, allow outbound HTTPS (port 443) to the VIDIZMO licensing endpoint, and note that an air-gapped environment may need a proxy exception for it.
How Many AI Servers Do You Need
Start With This Idea
There are two separate questions, and they are easy to mix up:
- How much storage do I need? This is driven by your total data volume, for example 100 TB per month. It sizes your disks and storage system.
- How many AI servers do I need? This is driven by how much of that content you run through AI detection, measured in hours and pages, not terabytes.
Think of it like a records office. The size of the filing cabinets depends on how many documents you store (storage). How many clerks you need depends on how many files must be read and analyzed, and how long each one takes (AI servers).
Why Storage Size Does Not Tell You the AI Workload
Detection time depends on the length or count of the content, and different media pack very differently into a terabyte.
| 1 TB of | Is roughly | AI detection time |
|---|---|---|
| Video | ~500 hours of footage | The longest, video is the heavy one |
| Audio | ~10,000 hours of recordings | Fast, well above real time |
| Documents | ~5 million pages | Very fast per page |
| Images | ~350,000 images | Fast per image |
A terabyte of video and a terabyte of documents create completely different amounts of AI work. Video is the most demanding, because every frame is scanned, while audio runs about 11× real time and documents and images are very fast per item.
A Simple Four-Step Method
- Decide how much of your content needs AI. You may not need to analyze everything you store, often just the content that has to be searchable, redacted, or compliant. This percentage is the single biggest factor.
- Break that content into hours and counts. Estimate hours of video and audio, and pages and images per month, that will go through detection.
- Apply the measured detection speeds per server. Video about 1.9× real time, audio about 11× real time, documents about 22,600 pages per hour, and images about 1,350 per hour.
- Divide by what one server delivers in your processing window. A server delivers about 584 useful processing-hours per month running around the clock, and less if you only process overnight.
Operational Scaling Guidance
VIDIZMO scales the AI detection tier horizontally. The content-processing service dispatches jobs through the queue and service discovery locates available instances, so adding AI servers raises total throughput with no application reconfiguration. The queue simply drains across more workers.
- Scale GPU-dense AI servers for high-volume video and image detection, since those are the heaviest loads on the tier.
- Prefer larger-VRAM GPUs for concurrency, because more VRAM lets one server hold the full model set and run larger batches.
- Use the processing window deliberately. An overnight batch can cut the number of required servers several-fold compared with same-hour turnaround.
- Monitor shared-storage I/O as you add nodes. A single node is comfortable at 10 Gbps with NVMe, but multi-node scaling can raise shared-storage demand.
- Plan for high availability if detection is mission-critical. Distributed AI servers and SQL Server Always On provide redundancy, and you should plan backup and disaster recovery for the on-premises content store and configuration.
Summary and Recommendations
For a deployment with a small user base but full AI detection across all media, concentrate investment in GPU-equipped AI Servers, scaled horizontally against measured throughput and your agreed processing window.
- Documents and images are very fast to detect, at about 22,600 pages per hour and 1,350 images per hour per server.
- Audio detection runs about 11× real time, and transcription alone runs near 36× real time.
- Video is the cost driver at about 1.9× real time and should anchor the sizing model. Denser footage runs slower.
- Equip the AI Server with a 1 TB SSD or NVMe disk and a CUDA-capable NVIDIA GPU. For detection, CUDA compute and VRAM are the throughput drivers.
- Begin with a small pilot on real content to confirm your content mix and detection percentage, then add AI nodes as volume grows or service levels tighten.
Planning Assumptions and Notes
Key assumptions used in the sizing method:
- Bytes per unit: video about 2 GB per hour, audio about 0.1 GB per hour, image about 3 MB, document about 0.2 MB per page.
- Server availability: about 584 processing-hours per month (24×7 at about 80% utilization).
- Benchmark data was collected on a single AI host using multiple GPU models, including the RTX 5090 and RTX 6000 Blackwell.
- GPU specifications are drawn from NVIDIA product documentation and developer references. GPU lineup details are current as of mid-2026 and should be re-verified at procurement time.
See Also
- Design and Architecture Overview
- Deploy VIDIZMO - On-premises and Dedicated Cloud Environments
- Prerequisites for VIDIZMO Application